White Wine Quality Exploration by Ying Xiong

The dataset we are looking at is presented by Cortez et al. (see reference below), which contains the large collection (about 5000) of white wines with their quality evaluated by experts together with various physical or chemical properties, such as density, pH, alcohol, etc.

The goal of this project is to analyze and understand this dataset. In particular, we would like to find answers to following questions:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Univariate Plots and Analysis

What is the structure of your dataset?

Below prints a simple summary of the data.

There are 4898 white wine observations in the dataset with 13 variables in total, including an index variable (named “X”), the “quality” variable, and 11 other variables describing the chemical properties of the wine.

The quality of the wine is an integer variable which has has a min 3.0 and max 9.0, with a median 6.0 and mean 5.878.

All the chemical property variables are floating numbers. They are of different unit and therefore lie in widely different range. For example, the chlorides variable has a small range from 0.009 to 0.346, while the total.sulfur.dioxide variable has a large range from 8.0 to 440.0.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are alcohol and quality. I suspect alcohol and some combination of other variables can be used to build a predictive model to the wine quality.

Below we plot the histogram for the quality variable. The variable is discrete, but we can see its histogram has a typical normal distribution shape.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Similarly, we plot the histogram for the alcohol variable. The distribution is plotted at different binwidth, so that we can look at data with different “resolution”. At the coarse level (binwidth=1), we see that it follows a skewed distribution with most number of samples in [9, 10], followed by [10, 11], and then [11, 12], etc. At the fine level (binwidth=0.1), we see more irragularities of the distribution with multiple spikes, say at [9.0, 9.1], [9.5, 9.6], [10.0, 10.1], etc.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Features such as residual.sugar, sulphates, pH, chlorides will likely contribute to the wine quality and will support our investigation.

Did you create any new variables from existing variables in the dataset?

I created an ordered factor version of quality from its orignal integer version. Furthermore, I grouped the wine quality into 3 buckets [(3,4,5), (6), (7,8,9)] so that we get more samples in each bucket for better analysis.

df$quality.ordered <- as.ordered(df$quality)
df$quality.bucketed <- cut(df$quality, c(2, 5, 6, 10))

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

During the investigation, I found the distribution of chlorides variable has an unusual distribution. From the histogram shown below, we see that the majority of samples lie in the range of [0, 0.1] in a normal distribution shape, but there are a small number of outliers that lie far beyond this normal range (up to 0.34), which indicates this is a long-tail distribution.

In order to better visualize this distribution, we tried two approaches 1. Cut off the samples that are beyond 0.1, and only “zoom in” to look at those in the “regular range”; 2. Plot the distribution in a log10 scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Fixed acidity

The normal range of fixed.acidity is 5.0 to 10.0. There are a small number of outliers that have values larger than this range.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Residual sugar

The normal range of residual.sugar is 0.0 to 20.0. Again, there are a few outliers with values much larger than this range (up to 65.8).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Density

The normal range of density is 0.99 to 1.00. Most of the samples are within this range, with a few out of the range but not significantly larger (up to 1.039).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Sulphates

The normal range of sulphates is 0.2 to 1.0. Almost all samples are within this “normal range”, with a few exceptions just outside of this range but not far away.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Bivariate Plots and Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We make a box plot for alcohol level for each different quality below.

We can see that there is a clear dependency between alcohol and quality: the alcohol level tends to be high for both low quality and high quality wines, but low for medium quality wines. This is a very interesting observation to myself.

Also, we see that the highest quality wine (9) has quite concentrated alcohol level, in other words, the variance of alcohol level for wine of this quality is low. Later I realized that this is because there are very few samples (5 in total) with quality score being 9, and therefore the small variance could partly be attributed to lack of data.

## Correlation:  0.4355747

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We found that there is a weak inverse correlation between chlorides and wine quality. From the figure below, we see that apart from the lowest and highest qualities where we have relatively small number of data points, the rest of the wines tend to have a higher quality when its chlorides level is lower.

## Correlation:  -0.2099344

What was the strongest relationship you found?

We compute correlation of quality against each individual feature in the data set, and print the result table below. We see that alcohol has strongest correlation (0.435) with quality, and density has strongest negative correlation (-0.307). The latter is not expected before analyzing the dataset.

##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.113662831         -0.194722969         -0.009209091 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##         -0.097576829         -0.209934411          0.008158067 
## total.sulfur.dioxide              density                   pH 
##         -0.174737218         -0.307123313          0.099427246 
##            sulphates              alcohol 
##          0.053677877          0.435574715

Quality v.s. density

There is a negative correlation between density and quality. This is partly exhibited in the box plot visualization, e.g. the highest quality samples have lowest density quantiles. We also see the two density outliers (values larger than 1.005) are with medium quality value 6.

## Correlation:  -0.3071233

Quality v.s. sulphates

There is a very small positive correlation between sulphates and quality, which is also supported by the box plot. We can see there is not significant change of sulphates quantiles or median amoung different wine qualities.

## $title
## [1] "Quality v.s. sulphates"
## 
## attr(,"class")
## [1] "labels"
## Correlation:  0.05367788

Quality v.s. fixed acidity

There is a small negative correlation between sulphates and quality, which is supported by the box plot. As the quality of the wine increases (from left to right), the median and quantiles of fixed acidity slightly decreases, with the exception of highest quality wine, which in fact have a relatively large fixed acidity comparing to others.

## Correlation:  -0.1136628

Density v.s. alcohol

We can see that there is a strong negative correlation between density and alcohol, mostly because alcohol itself has smaller density than water (which makes majority of the wine). The scatter plot confirms this observation, and also shows two outliers with large density but not extraordinary alcohol level.

## Correlation:  -0.7801376

Volatile acidity v.s. fixed acidity

I expect that there should be some correlation between volatile acidity and fixed acidity, because they are somehow chemically related (according to my very limited chemistry knowledge). The visualization below shows that the correlation is in fact very low, which means the two properties are not closely related. For example, it is normal to for a wine sample to have high fixed acidity but relatively low volatile acidity, or vice versa (high fixed acidity with high volatile acidity as well).

## Correlation:  -0.02269729

“Non-free” sulfur dioxide v.s. free sulfur dioxide

I plotted the “non-free” sulfur dioxide (computed as the difference of total sulfur dioxide and free sulfur dioxide) versus the free sulfur dioxide. The results suggest that there is a weak correlation: for samples with high level of free sulfur dioxide, their level of “non-free” sulfur dioxide is usually also high (although not always).

## Correlation:  0.2635373

Sulphates v.s. total sulfur dioxide

As the “volatile acidity v.s fixed acidity”, the correlation between sulphates and total sulfur dioxide is also very small. The visualization plot also confirms this claim: total sulfur dioxide level of a sample has very small predictive power to the sample’s sulphates level.

## Correlation:  0.05921725

Multivariate Plots and Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We plot the chlorides with respect to sulphates in the figure below, and grouped and colored by different wine quality. From this plot we see that conditioned on wine quality group, the chlorides is mostly independent (constant) with respect to sulphates. Also we see that low quality wine tends to have higher chlorides level while high quality wine tends to have lower chlorides level, despite the sulphates roughly span the same range for each quality group.

We also added the scatter plot of all data points, and we can see the variation of chlorides given sulphates is quite large, but the general trend is visible: low quality wines (red points) tend to have larger chlorides than high quality wines (blue points).

Were there any interesting or surprising interactions between features?

One of the interesting and somewhat surprising fact I found is that the relations of most pairs of features are independent on wine quality. Take the following point for example, when plotting density against alcohol, grouped by different quality, we see that they mostly follow the same decreasing relationship, and the curves are actually very close or indistinguishable from each other. After thinking about it more, I believe it is reasonable that the relationships of one physical/chemical property against another physical/chemical property are mostly consistent and independent to quality, because this is usually governed by the laws of physics/chemistry instead of human taste. For example, the more alcohol some wine contains, the lighter (smaller density) that it will be, because alcohol has smaller density than water (which makes most part of the wine), and this fact holds regardless how the wine tastes.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

In this plot we draw the histogram and density of alcohol level. The binwidth of the histogram is set to 0.2, and the density is estimated with a Gaussian kernel with default adjust=1.

From the visualization we see that the alcohol level in the sample set is asymmetric (not normal distribution). More specifically, we see that the it is skewed towards the lower end, that is there are more wines with lower alcohol level (9 to 10) than those with higher alcohol level (11 to 12).

Plot Two

## Correlation:  0.4355747

Description Two

In this plot we draw the quality of wine v.s. the their chlorides level. We use a scatter plot with alpha=0.5 plus some jittering to show visualize the actual distribution of the alcohol and different quality level. In addition, we also plotted the 10% and 90% quality (blue bars) together with the median (red cross) for better visualizing the general trend of data.

From the exploration above, we found that the alcohol is the feature with largest correlation (0.435) to wine quality amoung all the given features. We can see that for wine samples of quality 5 or larger, the quality gets better as the median alcohol level grows (the red cross drifts rightwards). However, we also see that low quality wines (3 and 4) also tends to have higher alcohol level.

This observation is very interesting and also reasonable to myself: usually people like the taste of “good alcohol” from wine, the one that generated from fruit fermenting for a long enough time; but there are also manufacturers trying to artifically boost the alcohol level of their wine, in which case the tasting experts (and a lot of ordinary people) will be able to tell.

Plot Three

Description Three

In this plot we draw a scatter plot of alcohol versus residual sugar, colored by the wine quality and super-imposed with the median curve.

From this plot we can see some distinct phenomenons of combining to different features to make better prediction about the wine quality. For example, at the residual sugar range below 10, there is clear trends that the higher alcohol level is, the better wine quality tends to be. After crossing that residual sugar level, all wine tends to have low alcohol, and its distinguish power is diminished. This effect is not only visible from the median statistics, but also from the scatter plot: at left-half of the plot (low residual sugar), blue points (high quality wines) tend to sits higher (large alcohol level) than the green and red points.


Reflection

I have several take-home message from this project: